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ABSTRACT 

The  study  reported  in  this  paper  addresses  three  issues  related 
to  phonetic  classification:  1)  whether  it  is  important  to  choose  an 
appropriate  signal  representation,  2)  whether  there  are  any  ad¬ 
vantages  in  extracting  acoustic  attributes  over  directly  using  the 
spectral  information,  and  3)  whether  it  is  advantageous  to  intro¬ 
duce  an  intermediate  set  of  linguistic  units,  i.e.  distinctive  fea¬ 
tures.  To  restrict  the  scope  of  our  study,  we  focused  on  16  vowels 
in  American  English,  and  investigated  classification  performance 
using  an  artificial  neural  network  with  nearly  22,000  vowels  tokens 
from  550  speakers  excised  from  the  TIMIT  corpus.  Our  results 
indicate  that  1)  the  combined  outputs  of  Seneff’s  auditory  model 
outperforms  five  other  representations  with  both  undegraded  and 
noisy  speech,  2)  acoustic  attributes  give  similar  performance  to 
raw  spectral  information,  but  at  potentially  considerable  com¬ 
putational  savings,  and  3)  the  distinctive  feature  representation 
gives  similar  performance  to  direct  vowel  classification,  but  po¬ 
tentially  offers  a  more  flexible  mechanism  for  describing  context 
dependency. 


INTRODUCTION 

The  overall  goal  of  our  study  is  to  explore  the  use  of  dis¬ 
tinctive  features  for  automatic  speech  recognition.  Distinc¬ 
tive  features  are  a  set  of  properties  that  linguists  use  to  clas¬ 
sify  phonemes  [1,13].  More  precisely,  a  feature  is  a  minimal 
unit  which  distinguishes  two  maximally-close  phonemes;  for 
example  /b/  and  /p /  are  distinguished  by  the  feature  [voice]. 
Sounds  are  more  often  confused  in  relation  to  the  number  of 
features  they  share,  and  it  is  believed  that  around  15  to  20 
distinctive  features  are  sufficient  to  account  for  phonemes  in 
all  languages  of  the  world.  Moreover,  the  values  of  these  fea¬ 
tures,  such  as  [+HIGH]  or  [-round],  correspond  directly  to 
contextual  variability  and  coarticulatory  phenomena,  and  of¬ 
ten  manifest  themselves  as  well-defined  acoustic  correlates  in 
the  speech  signal  [3].  The  compactness  and  descriptive  power 
of  distinctive  features  may  enable  us  to  describe  contextual 
influence  more  parsimoniously,  and  thus  to  make  more  effec¬ 
tive  use  of  available  training  data. 

'This  research  was  supported  by  DARPA  under  Contract  N 0001 4- 
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In  order  to  fully  assess  the  utility  of  this  linguistically 
well-motivated  set  of  units,  several  important  issues  must  be 
addressed.  First,  is  there  a  particular  spectral  representa¬ 
tion  that  is  preferred  over  others?  Second,  should  we  use  the 
spectral  representation  directly  for  phoneme/feature  classi¬ 
fication,  or  should  we  instead  extract  and  use  acoustic  cor¬ 
relates?  Finally,  does  the  introduction  of  an  intermediate 
feature-based  representation  between  the  signal  and  the  lexi¬ 
con  offer  performance  advantages?  We  have  chosen  to  answer 
these  questions  by  performing  a  set  of  phoneme  classification 
experiments  in  which  conditional  variables  are  systematically 
varied.  The  usefulness  of  one  condition  over  another  is  in¬ 
ferred  from  the  performance  of  the  classifier. 

In  this  paper,  we  will  report  our  study  on  the  three  ques¬ 
tions  that  we  posed  earlier.  First,  we  will  report  our  compar¬ 
ative  study  on  signal  representations.  Based  on  these  results, 
we  will  then  describe  our  experiments  and  results  on  acous¬ 
tic  attribute  extraction,  and  the  use  of  distinctive  features. 
Finally,  we  will  discuss  the  implications  and  make  some  ten¬ 
tative  conclusions. 

TASK  AND  CORPUS 

The  task  chosen  for  our  experiments  is  the  classification2 
of  vowels  in  American  English.  The  corpus  consists  of  13 
monothongs  / i,  i,  e,  e,  te,  a,  o,  a,  o,  u,  u,  ii  and  &/  and  3 
diphthongs  fay,  oy,  aw/.  The  vowels  are  excised  from  the 
acoustic-phonetically  compact  portion  of  the  TIMIT  corpus 
[6],  with  no  restrictions  imposed  on  the  phonetic  contexts  of 
the  vowels.  For  the  signal  representation  study,  experiments 
are  based  on  the  task  of  classifying  all  16  vowels.  However, 
the  dynamic  nature  of  the  diphthongs  may  render  distinctive 
feature  specification  ambiguous.  As  a  result,  we  excluded  the 
diphthongs  in  our  investigation  involving  distinctive  features, 
and  the  size  of  the  training  and  test  sets  were  reduced  cor¬ 
respondingly.  The  size  and  contents  of  the  two  corpora  are 
summarized  in  Table  1. 

5It  is  a  classification  task  in  that  the  left  and  right  boundaries  of 
the  vowel  token  are  known  through  a  hand-labelling  procedure,  and  the 
classifier  is  only  asked  to  determine  the  most  likely  label. 
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Training 
Speakers  (M/F) 

Testing 

Speakers  (M/F) 

Training 

Tokens 

Testing 

Tokens 

I 

500  (357/143) 

50  (33/17) 

20,000 

2,000 

II 

500  (357/143) 

50  (33/17) 

19,000 

1,700 

Table  Is  Corpus  I  consists  of  16  monothong  and  diphthong  vow¬ 
els.  It  is  used  for  investigation  of  signal  representation.  Corpus 
II  is  a  subset  of  Corpus  I.  It  consists  of  the  monothongs  only,  and 
is  used  for  investigation  of  distinctive  features. 


For  the  experiments  dealing  with  distinctive  features,  we 
characterized  the  13  vowels  in  terms  of  6  distinctive  features, 
following  the  conventions  set  forth  by  others  [13].  The  feature 
values  for  these  vowels  are  summarized  in  Table  2. 

The  classifier  for  our  experiments  was  selected  with  the 
following  considerations.  First,  to  facilitate  comparisons  of 
different  results,  we  restrict  ourselves  to  use  the  same  classi¬ 
fier  for  all  experiments.  Second,  the  classifier  must  be  flexible 
in  that  it  does  not  make  assumptions  about  specific  statis¬ 
tical  distributions  or  distance  metrics,  since  different  signal 
representations  may  have  different  characteristics.  Based  on 
these  two  constraints,  we  have  chosen  to  use  the  multi-layer 
perceptron  (MLP)  [7],  In  our  signal  representation  exper¬ 
iments,  the  network  contains  16  output  units  representing 
each  of  the  16  vowels.  The  input  layer  contains  120  units, 
40  units  each  representing  the  initial,  middle,  and  final  third 
of  the  vowel  segment.  For  the  experiments  involving  acous¬ 
tic  attributes  and  distinctive  features,  the  input  layer  may 
be  the  spectral  vectors,  a  set  of  acoustic  attributes,  or  the 
distinctive  features,  and  the  output  layer  may  be  the  vowel 
labels  or  the  distinctive  features,  as  will  be  described  later. 

All  networks  have  a  single  hidden  layer  with  32  hidden 
units.  This  and  other  parameters  had  previously  been  adapted 
for  better  learning  capabilities.  In  addition,  input  normaliza¬ 
tion  and  center  initialization  have  been  used  [8]. 

SIGNAL  REPRESENTATION 

Review  of  Past  Work 

Several  experiments  on  comparing  signal  representations 
have  been  reported  in  the  past.  Mermelstein  and  Davis  [10] 
compared  the  mel-frequency  cepstral  coefficients  (MFCC)  with 
four  other  more  conventional  representations.  They  found 
that  a  set  of  10  MFCC  resulted  in  the  best  performance, 
suggesting  that  the  mel-frequency  cepstra  possess  significant 
advantages  over  the  other  representations.  Hunt  and  Lefeb- 
vre  [4]  compared  the  performance  of  their  psychoacoustically- 
motivated  auditory  model  with  that  of  a  20-channel  mel- 
cepstrum.  They  found  that  the  auditory  model  gave  the 
highest  performance  under  all  conditions,  and  is  least  affected 
by  changes  in  loudness,  interfering  noise  and  spectral  shap¬ 
ing  distortions.  Later,  they  [5]  conducted  another  comparison 
with  the  auditory  model  output,  the  mel-scale  cepstrum  with 


various  weighing  schemes,  cepstrum  coefficients  augmented 
by  the  ^-cepstrum  coefficients,  and  the  IMELDA  represen¬ 
tation  which  combined  between-class  covariance  information 
with  within-class  covariance  information  of  the  mel-scale  fil¬ 
ter  bank  outputs  to  generate  a  set  of  linear  discriminant  func¬ 
tions.  The  IMELDA  outperformed  all  other  representations. 
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Table  2:  The  set  of  distinctive  features  used  to  characterize  13 
vowels 

These  studies  generally  show  that  the  choice  of  paramet¬ 
ric  representations  is  very  important  to  recognition  perfor¬ 
mance,  and  auditory-based  representations  generally  yield 
better  performance  than  more  conventional  representations. 
In  the  comparison  of  the  psychoacoustically-motivated  audi¬ 
tory  model  with  MFCC,  however,  different  methods  of  analy¬ 
sis  led  to  different  results.  Therefore,  it  will  be  interesting  to 
compare  outputs  of  an  auditory  model  with  the  computation¬ 
ally  simpler  mel-based  representation  when  the  experimental 
conditions  are  more  carefully  controlled. 

Experimental  Procedure 

Our  study  compares  six  acoustic  representations  [9],  using 
the  MLP  classifier.  Three  of  the  representations  are  obtained 
from  the  auditory  model  proposed  by  Seneff  [12],  Two  repre¬ 
sentations  are  based  on  mel-frequency,  which  has  gained  pop¬ 
ularity  in  the  speech  recognition  community.  The  remaining 
one  is  based  on  conventional  Fourier  transform.  Attention 
is  focused  upon  the  relative  classification  performance  of  the 
representations,  the  effects  of  varying  the  amount  of  train¬ 
ing  data,  and  the  tolerance  of  the  different  representations  to 
additive  white  noise. 

For  each  representation,  the  speech  signal  is  sampled  at  16 
kHz  and  a  40-dimensional  spectral  vector  is  computed  once 
every  5  ms,  covering  a  frequency  range  of  slightly  over  6  kHz. 
To  capture  the  dynamic  characteristics  of  vowel  articulation, 
three  feature  vectors,  representing  the  average  spectra  for  the 
initial,  middle,  and  final  third  of  every  vowel  token,  are  de¬ 
termined  for  each  representation.  A  120-dimensional  feature 
vector  for  the  MLP  is  then  obtained  by  appending  the  three 
average  vectors. 

Seneff’s  auditory  model  (SAM)  produces  two  outputs:  the 
mean-rate  response  (MR)  which  corresponds  to  the  mean 
probability  of  firing  on  the  auditory  nerve,  and  the  synchrony 
response  (SR)  which  measures  the  extent  of  dominance  at 
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the  critical  band  filters’  characteristic  frequencies.  Each  of 
these  responses  is  a  40-dimensional  spectral  vector.  Since  the 
mean-rate  and  synchrony  responses  were  intended  to  encode 
complementary  acoustic  information  in  the  signal,  a  repre¬ 
sentation  combining  the  two  is  also  included  by  appending 
the  first  20  principal  components  of  the  MR  and  SR  to  form 
another  40-dimensional  vector  (SAM-PC). 

To  obtain  the  mel-frequency  spectral  and  cepstral  coeffi¬ 
cients  (MFSC  and  MFCC,  respectively),  the  signal  is  pre¬ 
emphasized  via  first  differencing  and  windowed  by  a  25.6 
ms  Hamming  window.  A  256-point  discrete  Fourier  trans¬ 
form  (DFT)  is  then  computed  from  the  windowed  waveform. 
Following  Mermelstein  et  al  [10],  these  Fourier  transform 
coefficients  are  later  squared,  and  the  resultant  magnitude 
squared  spectrum  is  passed  through  the  mel-frequency  tri¬ 
angular  filter-banks  described  below.  The  log  energy  output 
(in  decibels)  of  each  filter,  Xk,k  =  1,2,. .,40,  collectively 
form  the  40-dimensional  MFSC  vector.  Carrying  out  a  co¬ 
sine  transform  [10]  on  the  MFSC  according  to  the  following 
equation  yields  the  MFCC’s,  F),  i  =  1,2,  ..,40. 

40  1 

Yi  =  Y^Xk  cos[t(A:  -  -)— ] 

The  lowest  cepstrum  coefficient,  lb,  is  excluded  to  reduce 
sensitivity  to  overall  loudness. 

The  mel-frequency  triangular  filter  banks  are  designed  to 
resemble  the  critical  band  filter  bank  of  SAM.  The  filter  bank 
consists  of  40  overlapping  triangular  filters  spanning  the  fre¬ 
quency  region  from  130  to  6400  Hz.  Thirteen  triangles  are 
evenly  spread  on  a  linear  frequency  scale  from  130  Hz  to  1 
kHz,  and  the  remaining  27  triangles  are  evenly  distributed  on 
a  logarithmic  frequency  scale  from  1  kHz  to  6.4  kHz,  where 
each  subsequent  filter  is  centered  at  1.07  times  the  previous 
filter’s  center  frequency.  The  area  of  each  triangle  is  normal¬ 
ized  to  unit  magnitude. 

The  Fourier  transform  representation  is  obtained  by  com¬ 
puting  a  256-point  DFT  from  a  smoothed  cepstrum,  and  then 
downsampling  to  40  points. 

One  of  the  experiments  investigates  the  relative  immunity 
of  each  representation  to  additive  white  noise.  The  noisy  test 
tokens  are  constructed  by  adding  white  noise  to  the  signal  to 
achieve  a  peak  signal-to-noise  ratio  (SNR)  of  20dB,  which 
corresponds  to  a  SNR  (computed  with  average  energies)  of 
slightly  below  lOdB. 

Results 

For  each  acoustic  representation,  four  separate  experi¬ 
ments  were  conducted  using  2,000,  4,000,  8,000,  and  finally 
20,000  training  tokens.  In  general,  performance  improves  as 
more  training  tokens  are  utilized.  This  is  illustrated  in  Fig¬ 
ure  1,  in  which  accuracies  on  training  and  testing  data  as  a 
function  of  the  amount  of  training  tokens  for  SAM-PC  and 


MFCC.  As  the  size  of  the  training  set  increases,  so  does  the 
classification  accuracy  on  testing  data.  This  is  accompanied 
by  a  corresponding  decrease  in  performance  on  training  data. 
At  20,000  training  tokens,  the  difference  between  training  and 
testing  set  performance  is  about  5%  for  both  representations. 


Number  of  Training  Tokens 


Figure  1:  Effect  of  increasing  training  data  on  testing  accuracies 

To  investigate  the  relative  immunity  of  the  various  acous¬ 
tic  representations  to  noise  degradation,  we  determine  the 
classification  accuracy  of  the  noise-corrupted  test  set  on  the 
networks  after  they  have  been  fully  trained  on  clean  tokens. 
The  results  with  noisy  test  speech  are  shown  in  Figure  2, 
together  with  the  corresponding  results  on  the  clean  test  set. 
The  decrease  in  accuracy  ranges  from  about  12%  (for  the 
combined  auditory  model)  to  almost  25%  (for  the  DFT). 


Acoustic  Representation 

Figure  2:  Performance  on  noisy  and  clean  speech 

ACOUSTIC  ATTRIBUTES  AND 
DISTINCTIVE  FEATURES 

Our  experiments  were  again  conducted  using  an  MLP 
classifier  for  speaker  independent  vowel  classification.  Three 
experimental  parameters  were  systematically  varied,  result¬ 
ing  in  six  different  conditions,  as  depicted  in  Figure  3.  These 
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three  parameters  specify  whether  the  acoustic  attributes  are 
extracted,  whether  an  intermediate  distinctive  feature  repre¬ 
sentation  is  used,  and  how  the  feature  values  are  combined 
for  vowel  classification.  In  some  conditions  (cf.  conditions 

A,  E,  and  f),  the  spectral  vectors  from  the  mean-rate  re¬ 
sponse  were  used  directly,  whereas  in  others  (cf.  conditions 

B,  C,  and  d),  each  vowel  token  was  represented  by  a  set 
of  automatically-extracted  acoustic  attributes.  In  still  other 
conditions  (cf.  conditions  C,  D,  E,  and  f),  an  intermediate 
representation  based  on  distinctive  features  was  introduced. 
The  feature  values  were  either  used  directly  for  vowel  identi¬ 
fication  through  one  bit  quantization  (i.e.  transforming  them 
into  a  binary  representation)  and  table  look-up  (cf.  condi¬ 
tions  C  and  e),  or  were  fed  to  another  MLP  for  further  clas¬ 
sification  (cf.  conditions  D  and  f).  Taken  as  a  whole,  these 
experiments  will  enable  us  to  answer  the  questions  that  we 
posed  earlier.  Thus,  for  example,  we  can  assess  the  usefulness 
of  extracting  acoustic  attributes  by  comparing  the  classifica¬ 
tion  performance  of  conditions  A  versus  B  and  D  versus  F. 


Figure  3:  Experimental  paradigm  comparing  direct  phonetic 
classification  with  attribute  extraction,  and  the  use  of  linguistic 
features.  The  mean  rate  response  is  chosen  to  be  the  signal. 

Acoustic  Representation 

Each  vowel  token  is  characterized  either  directly  by  a  set 
of  spectral  coefficients,  or  indirectly  by  a  set  of  automatically 
derived  acoustic  attributes.  In  either  case,  three  average  vec¬ 
tors  are  used  to  characterize  the  left,  middle,  and  right  thirds 
of  the  token,  in  order  to  implicitly  capture  the  context  de¬ 
pendency  of  vowel  articulation. 

Spectral  Representation  Comparative  experiments  des¬ 
cribed  in  the  previous  section  indicate  that  representations 
from  Seneff’s  auditory  model  result  in  performance  superior 
to  others.  While  the  combined  mean  rate  and  synchrony 


representation  (SAM-PC)  gave  the  best  performance,  it  may 
not  be  an  appropriate  choice  for  our  present  work,  since  the 
heterogeneous  nature  of  the  representation  poses  difficulties 
in  acoustic  attribute  extraction.  As  a  result,  we  have  selected 
the  next  best  representation  -  the  mean  rate  response  (MR) . 

Acoustic  Attributes  The  attributes  that  we  extract  are 
intended  to  correspond  to  the  acoustic  correlates  of  distinc¬ 
tive  features.  However,  we  do  not  as  yet  possess  a  full  under¬ 
standing  of  how  these  correlates  can  be  extracted  robustly. 
Besides,  we  must  somehow  capture  the  variabilities  of  these 
features  across  speakers  and  phonetic  environments.  For 
these  reasons,  we  have  adopted  a  more  statistical  and  data- 
driven  approach.  In  this  approach,  a  general  property  de¬ 
tector  is  proposed,  and  the  specific  numerical  values  of  the 
free  parameters  are  determined  from  training  data  using  an 
optimization  criterion  [14].  In  our  case,  the  general  property 
detectors  chosen  are  the  spectral  center  of  gravity  and  its 
amplitude.  This  class  of  detectors  may  carry  formant  infor¬ 
mation,  and  can  be  easily  computed  from  a  given  spectral 
representation.  Specifically,  we  used  the  mean  rate  response, 
under  the  assumption  that  the  optimal  signal  representation 
for  phonetic  classification  should  also  be  the  most  suitable 
for  defining  and  quantifying  acoustic  attributes,  from  which 
distinctive  features  can  eventually  be  extracted. 

The  process  of  attribute  extraction  is  as  follows.  First, 
the  spectrum  is  shifted  down  linearly  on  the  bark  scale  by 
the  median  pitch  for  speaker  normalization.  For  each  distinc¬ 
tive  feature,  the  training  tokens  are  divided  into  two  classes 
-  [-(-feature]  and  [-feature].  The  lower  and  upper  frequency 
edges  (or  “free  parameters”)  of  the  spectral  center  of  grav¬ 
ity  are  chosen  so  that  the  resultant  measurement  can  maxi¬ 
mize  the  Fisher’s  Discriminant  Criterion  (FDC)  between  the 
classes  [-(-feature]  and  [-feature]  [2]. 

For  the  features  [back],  [tense],  [round],  and  [retro¬ 
flex]  only  one  attribute  per  feature  is  used.  For  [high]  and 
[low],  we  found  it  necessary  to  include  two  attributes  per 
feature,  using  the  two  sets  of  optimized  free  parameters  giving 
the  highest  and  the  second  highest  FDC.  These  8  frequency 
values,  together  with  their  corresponding  amplitudes,  make 
up  16  attributes  for  each  third  of  a  vowel  token.  Therefore, 
the  overall  effect  of  performing  acoustic  attribute  extraction 
is  to  reduce  the  input  dimensions  from  120  to  48. 

Results 

The  results  of  our  experiments  are  summarized  in  Fig¬ 
ure  4,  plotted  as  classification  accuracy  for  each  of  the  condi¬ 
tions  shown  in  Figure  3.  The  values  in  this  figure  represent 
the  average  of  six  iterations;  performance  variation  among 
iterations  of  the  same  experiment  amounts  to  about  1%. 

By  comparing  the  results  for  conditions  A  and  B,  we  see 
that  there  is  no  statistically  significant  difference  in  perfor¬ 
mance  as  one  replaces  the  spectral  representation  by  the 
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Figure  4:  Performance  of  the  six  classification  pathways  in  our 
experimental  paradigm 


acoustic  attributes.  This  result  is  further  corroborated  by 
the  comparison  between  conditions  C  and  E,  and  D  and  F. 

Figure  4  shows  a  significant  deterioration  in  performance 
when  one  simply  maps  the  feature  values  to  a  binary  repre¬ 
sentation  for  table  look-up  (i.e.,  comparing  conditions  A  to  E 
and  B  to  c).  We  can  also  examine  the  accuracies  of  binary 
feature  assignment  for  each  feature,  and  the  results  are  shown 
in  Figure  5.  The  accuracy  for  individual  features  ranges  from 
87%  to  98%,  and  there  is  again  little  difference  between  the 
results  using  the  mean  rate  response  and  using  acoustic  at¬ 
tributes.  It  is  perhaps  not  surprising  that  table  look-up  us¬ 
ing  binary  feature  values  result  in  lower  performance,  since  it 
would  require  that  all  of  the  features  be  identified  correctly. 


Figure  5:  Distinctive  features  mapping  accuracies  for  the  mean 
rate  response  and  acoustic  attributes 

However,  when  we  use  a  second  MLP  to  classify  the  fea¬ 
tures  into  vowels,  a  considerable  improvement  (>  4%)  is  ob¬ 
tained  to  the  extent  that  the  resulting  accuracy  is  again  com¬ 
parable  to  other  conditions  (cf.  conditions  A  and  F,  and  con¬ 
ditions  B  and  d). 


DISCUSSION 

Our  results  indicate  that,  on  a  fully  trained  network,  rep¬ 
resentations  based  on  auditory  modelling  consistently  out¬ 
perform  other  representations.  The  best  among  the  three 
auditory-based  representations,  SAM  PC,  achieved  a  top- 
choice  accuracy  of  66%. 

The  MFSC  and  MFCC  representations  performed  worse 
than  the  auditory-based  representations  and  slightly  better 
that  the  DFT.  At  first  glance,  it  may  appear  that  the  dis¬ 
crepancies  are  small,  since  the  error  rate  is  only  increased 
slightly  (from  33%  to  38%).  However,  previous  research  on 
human  and  machine  identification  of  vowels,  independent  of 
context,  have  shown  that  the  best  performance  attained  is 
around  65%  [11].  Looking  in  this  light,  the  difference  in  per¬ 
formance  becomes  much  more  significant.  One  legitimate 
concern  may  be  that  principal  component  analysis  has  been 
applied  to  SAM  PC,  but  not  to  MFCC.  However,  the  cosine 
transform  used  in  obtaining  the  MFCC  performs  a  similar 
function  to  principal  component  analysis.  Experiments  have 
been  conducted  using  the  first  40  principal  components  of 
MFCC,  and  the  classification  accuracy  (61.3%)  shows  that 
principal  component  analysis  has  no  statistically  significant 
effects  on  performance.  It  may  also  be  argued  that  too  many 
MFCC  coefficients  have  been  used,  and  this  may  degrade  its 
performance.  But  further  experiments  have  shown  that  clas¬ 
sification  accuracy  increases  with  the  number  of  MFCC  used, 
and  using  40  MFCC  yielded  the  highest  performance.  There¬ 
fore,  we  may  tentatively  conclude  that  auditory-based  signal 
representations  are  preferred,  at  least  within  the  bounds  of 
our  experimental  conditions. 

Performance  on  noisy  speech  for  the  various  representa¬ 
tions  follows  the  trend  of  that  on  clean  speech,  with  the  ex¬ 
ception  that  the  range  of  accuracy  increased  substantially. 
The  degradation  of  the  SAM  representations  was  least  se¬ 
vere  -  about  12%,  whereas  the  mel-representations  showed  a 
drop  of  17%.  The  DFT  is  most  affected  by  noise,  and  its 
performance  degraded  by  over  24%.  We  believe  that  train¬ 
ing  with  clean  speech  and  testing  with  noisy  speech  is  a  fair 
experimental  paradigm  since  the  noise  level  of  test  speech  is 
often  unknown  in  practice,  but  the  environment  for  recording 
training  speech  can  always  be  controlled. 

Our  investigation  on  the  use  of  acoustic  attributes  is  partly 
motivated  by  the  belief  that  these  attributes  can  enhance 
phonetic  contrasts  by  focusing  upon  relevant  information  in 
the  signal,  thereby  leading  to  improved  phonetic  classifica¬ 
tion  performance  when  only  a  finite  amount  of  training  data 
is  available.  The  acoustic  attributes  that  we  have  chosen  are 
intuitively  reasonable  and  easy  to  measure.  But  they  are  by 
no  means  optimum,  since  we  did  not  set  out  to  design  the 
best  set  of  attributes  for  enhancing  vowel  contrasts.  Never¬ 
theless,  their  use  has  led  to  performance  comparable  to  the 
direct  use  of  spectral  information.  With  an  improved  under¬ 
standing  of  the  relationship  between  distinctive  features  and 
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Pathway 

A 

B 

C 

D 

E 

F 

Connections 

4288 

1984 

1760 

2400 

4064 

4704 

Table  3:  Sizes  of  the  networks  in  our  experimental  paradigm. 


their  acoustic  correlates,  and  a  little  more  care  in  the  de¬ 
sign  and  extraction  of  these  attributes,  it  is  conceivable  that 
better  classification  accuracy  can  be  obtained. 

Another  advantage  of  using  acoustic  attributes  is  a  saving 
on  run-time  computations  through  reduction  of  input  dimen¬ 
sions.  Table  3  compares  the  total  number  of  connections  in 
the  one  or  more  MLP  within  each  condition  in  our  experi¬ 
mental  paradigm.  With  a  small  amount  of  preprocessing,  the 
use  of  acoustic  attributes  can  save  about  half  of  the  compu¬ 
tations  required  by  the  direct  use  of  spectral  representation. 

One  potential  source  of  discrepancy  in  our  experiments 
has  to  do  with  pitch  normalization,  which  was  not  performed 
on  the  mean-rate  response.  However,  a  pitch-normalized 
spectral  center  of  gravity  measure  was  used  to  extract  acous¬ 
tic  attributes,  since  it  can  eliminate  singularities  that  compli¬ 
cate  the  search  for  a  maximum  PDC  value  in  the  optimization 
process.  However,  this  advantage  is  obtained  sometimes  at 
the  expense  of  getting  a  lower  FDC  value,  thus  leading  to 
poorer  performance.  While  we  do  not  feel  that  pitch  nor¬ 
malization  has  any  significant  effect  on  the  outcome  of  our 
experiments,  further  experiments  are  clearly  necessary. 

To  introduce  a  set  of  linguistically  motivated  distinctive 
features  as  an  intermediate  representation  for  phonetic  classi¬ 
fication,  we  first  transform  the  acoustic  representations  into 
a  set  of  features,  and  then  map  the  features  into  vowel  la¬ 
bels,  While  one  may  argue  that  such  a  two-step  process  is 
inherently  sub-optimal,  we  nevertheless  were  able  to  obtain 
comparable  performance,  corroborating  the  findings  of  Leung 

[7].  Such  an  intermediate  representation  can  offer  us  a  great 
deal  of  flexibility  in  describing  contextual  variations.  For  ex¬ 
ample,  all  vowels  sharing  the  feature  [+ round]  will  affect  the 
acoustic  properties  of  neighboring  consonants  in  predictable 
ways,  which  can  be  described  more  parsimoniously.  By  de¬ 
scribing  context  dependencies  this  way,  we  can  also  make  use 
of  training  data  more  effectively  by  collapsing  all  available 
data  along  a  given  feature  dimension. 

Figure  5  shows  that  performance  on  some  features  is  worse 
than  others,  presumably  due  to  inadequacies  in  the  attributes 
that  we  use.  For  example,  performance  on  the  feature  [TENSE] 
should  be  improved  by  incorporating  segment  duration  as  an 
additional  attribute.  When  a  second  classifier  is  used  to  map 
the  feature  values  into  vowel  labels,  a  4-5%  accuracy  increase 
is  realized  such  that  the  performance  is  again  comparable  to 
cases  without  this  intermediate  feature  representation.  This 
result  suggests  that  the  acoustic-phonetic  information  is  pre¬ 
served  in  the  aggregate  of  the  features,  and  that  the  sub¬ 
sequent  performance  recovery  may  be  a  consequence  of  the 


redundant  nature  of  distinctive  features,  as  well  as  the  ability 
by  the  second  classifier  to  capture  various  contextual  effects. 

Based  on  the  results  of  our  experiments,  we  may  tenta¬ 
tively  conclude  that  the  auditory-based  representations  are 
preferred.  Furthermore,  the  use  of  acoustic  attributes  can 
significantly  reduce  run-time  computation  for  vowel  classifi¬ 
cation  with  no  cost  to  accuracy.  Finally,  the  introduction  of 
an  intermediate  representation  based  on  distinctive  features 
can  potentially  provide  us  with  a  flexible  framework  to  de¬ 
scribe  contextual  variations  and  make  more  effective  use  of 
training  data,  again  at  no  cost  to  classification  performance. 
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